Identifying Semitic Roots: Machine Learning with Linguistic Constraints

نویسندگان

  • Ezra Daya
  • Dan Roth
  • Shuly Wintner
چکیده

Words in Semitic languages are formed by combining two morphemes: a root and a pattern. The root consists of consonants only, by default three, and the pattern is a combination of vowels and consonants, with non-consecutive “slots” into which the root consonants are inserted. Identifying the root of a given word is an important task, considered to be an essential part of the morphological analysis of Semitic languages, and information on roots is important for linguistics research as well as for practical applications. We present a machine learning approach, augmented by limited linguistic knowledge, to the problem of identifying the roots of Semitic words. Although programs exist which can extract the root of words in Arabic and Hebrew, they are all dependent on labor-intensive construction of large-scale lexicons which are components of full-scale morphological analyzers. The advantage of our method is an automation of this process, avoiding the bottleneck of having to laboriously list the root and pattern of each lexeme in the language. To the best of our knowledge, this is the first application of machine learning to this problem, and one of the few attempts to directly address non-concatenative morphology using machine learning. More generally, our results shed light on the problem of combining classifiers under (linguistically motivated) constraints.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Hebrew Roots: Machine Learning with Linguistic Constraints

The morphology of Semitic languages is unique in the sense that the major word-formation mechanism is an inherently non-concatenative process of interdigitation, whereby two morphemes, a root and a pattern, are interwoven. Identifying the root of a given word in a Semitic language is an important task, in some cases a crucial part of morphological analysis. It is also a non-trivial task, which ...

متن کامل

Learning to Identify Semitic Roots

The standard account of word-formation processes in Semitic languages describes words as combinations of two morphemes: a root and a pattern. The root consists of consonants only, by default three (although longer roots are known), called radicals. The pattern is a combination of vowels and, possibly, consonants too, with ‘slots’ into which the root consonants can be inserted. Words are created...

متن کامل

Linguistic Constraints on Statistical Word Segmentation: The Role of Consonants in Arabic and English.

Statistical learning is often taken to lie at the heart of many cognitive tasks, including the acquisition of language. One particular task in which probabilistic models have achieved considerable success is the segmentation of speech into words. However, these models have mostly been tested against English data, and as a result little is known about how a statistical learning mechanism copes w...

متن کامل

Statistical Learning of Semitic Morphology Using Autosegmental Orthography

Abstract The root and pattern system, as well as the system of reduplication, are essential to the morphological analysis of Arabic words. (McCarthy 1979, 1981) Few computational morphology systems have been designed to parse concatenative morphology, as well as roots and reduplication simultaneously, without the help of a dictionary. By using simple statistics, we show an algorithm that can le...

متن کامل

The Formation of Ethiopian Semitic Internal Reduplication

Semitic word formation has proved particularly contentious over the past several years with respect to the notion of the 'root' and 'template'. While traditional grammarians viewed Semitic words as consisting of roots and patterns (involved in paradigmatic relations), this approach has proved insufficient for certain kinds of word-formation that appear to require correspondence between words, s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computational Linguistics

دوره 34  شماره 

صفحات  -

تاریخ انتشار 2008